Table of Contents

back to TOC

1. About the project:

Banks are important institutions that provide funds, in terms of loans, to businesses and individuals to function and prosper. However banks need money to provide as loan and to also make their own investments (e.g. in stocks). One such good source that banks have is in the from of Term Deposits that bank's customers make.

Banks regularly make calls to their customers to secure such Term deposits. However, from a big list of all its customers, it would be wise to make calls to the customers who are more likely to invest. This way, banks can reduce the cost of acquisition of Term deposit (in the form of payment to staff making call, call charges and so on.).

This project aims at building a machine learning model that can be trained from previous marketing campaigns and data collected, to predict customers that potentialy will subscribe to Term deposit with the bank. Further, the prediction model will be hosted as a web service, which can accept customer data (in JSON format) and return the prediction (whether customer is likely to subscribe to Term deposit).

The dataset

Citation: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Data source: https://archive.ics.uci.edu/ml/datasets/bank+marketing (Also available at https://www.openml.org/d/1461 but has some differences due to data being processed a bit)

Datafile to be used: https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip

The bank-additional-full.csv (which has the complete dataset) is used for this project.

Notes from data source

Input variables:

bank client data:

related with the last contact of the current campaign:

other attributes:

social and economic context attributes

Output variable (desired target)

2. Import libraries and load data

back to TOC

3. Exploratory Data Analysis

back to TOC

This section performs various analysis of the dataset, split it into training, validation, test.

3.1 EDA - Basic

Check if all the columns have correct data type (sometimes numerical columns are marked categorical or vice versa)

Observations: The columns seem to be correctly typed

Check for missing data

Observations: No missing data

Check if any of the numerical columns have significantly high values (sometimes NaNs are filled as something like 99999999)

Observations: Looking at the max for all the numerical features, we can see that there are no significantly high values for any of the features.

For categorical features, check cardinality (distinct values and their count)

The categorical features do not have high cardinality.

The cardinality for the categorical columns look to be ok, not too much

Converting target variable from having 'yes'/'no' to 1/0

Observations: There is a high class imbalance in the target variable - 89% (no) - 11% (yes)

Many of the features have the value 'unknown' possibly since that piece of information was not available. Checking how many such records are there with unknowns and if removing these make sense.

Observations: Thus totally there are 10700 records with atleast one feature having value as 'unknown'. This is 35% of the total data. So it does not make sense to remove this data. Will have to go ahead with the 'unknown' data

Look at the distribution of numerical features

Observations: We can see that the numerical features 'duration', 'campaign', 'pdays', 'previous' are not normally distributed. Will use this information for additional EDA and when performing experiments with models and scores

3.2. EDA - additional

back to TOC

Splitting data as Train (70%), Val (20%), Test (10%)

Noting down which transformations should be tried for which features:

Feature Importance

Mutual information with categorical features

Observations: We can see that the features marital, day_of_week, housing and loan seem to not have any bearing on the outcome. Will check the score with and without these features.

Check global subscription rate relation to subscription rate feature wise

Co-relation of numerical features with target variable

Observations: We can see that most of the numerical features have correlation with target unlike categorical variables, where almost all categorical variables had lower mutual information score with target. The features age, duration, previous and cons.conf.idx have a postivie correlation, while campaign, pdays, emp.var.rate, cons.price.idx, euribor3m and nr.employed have negative correlation

To check absolute correlations (without considering if it is postive or negative, but simply how highly they are correlated)

Observations: All the numerical features seem to be important for predicting the outcome.

Important Note: The duration will not be considered in the model performance and the final model, based on the guidelines that were given with the dataset and mentioned below.

Lets look at correlations of numerical feature with other numerical features

Observations: We can see that some of the features have good correlation (>0.75) - like - emp.var.rate, nr.employed, euribor3m, cons.price.idx are co-related to each other. This makes sense as all these are social and economic context attributes.

4. Baseline model

back to TOC

Will check performance of baseline model using LogisticRegression with default parameters and evaluating using roc_auc_score (since the target variable has class imbalance). Will check the scores with and without the feature 'duration'

Score with feature 'duration'

Removing the feature 'duration' and checking the score

Observations: We will consider this score (without the 'duration' feature) as the baseline score. We will now see how we can improve the score. Later we will try using different models and then again see for those, how we can further improve the score by hyper-parameter tuning.

Deleting 'duration' feature from all the dataframes

5. Improvement over baseline

back to TOC

Idea here is to do several experiments using the algorithm used in baseline and find methods to improve the score. Then we will try to tune the paramaters of this model to further improve the score. After having worked with this model, we will look at other algorithms and compare the scores, then we will also tune the hyperparameters for these other models. Finally we will compare all the results to select the best model and parameters, which we will then use to train the full_train dataset and do final evaluation on the test dataset

5.1 Logistic Regression

5.1.1 Experiments to improve the score using LogisticRegression
5.1.2 Model tuning for LogisticRegression

5.2 DecisionTreeClassifier

Compare score of using DecisionTree with baseline using LogisticRegression

5.2.1 Experiments to improve the score using DecisionTreeClassifier
5.2.2 Model tuning for DecisionTreeClassifier

5.3 RandomForestClassifier

Compare score of using RandomForestClassifier with baseline using LogisticRegression

5.3.1 Experiments to improve the score using RandomForestClassifier
5.3.2 Model tuning for RandomForestClassifier

5.4 XGBoost

Compare score of using XGBoost with baseline using LogisticRegression

5.4.1 Experiments to improve the score using XGBoost
5.4.2 Model tuning for XGBoost

5.1 Logistic Regression (continued...)

back to TOC

5.1.1 Experiments to improve the score using LogisticRegression

back to TOC

Linear models work best when all the features have similar scale. Let us check whether scaling of numerical features helps increase the score. Using StandardScaler

Observations: There is considerable improvement in score after scaling using StandardScaler. Let us check results with other scaling (MinMaxScaler) and preprocessing (Polynomial)

Observations: There is further slight improvement in score with MinMaxScaler compared to StandarScaler.

Let us create polynomial features and check score

Observations: Score after using polynomial features has decreased and is even less than the baseline (here we kept original features as well as the polynomial features). Let us check replacing original numerical features by the corresponding polynomil features.

Observations: Using polynomial features replacing original is getting better result, however is still less than baseline.

Check scores with various transformations on numerical features that we noted down during additional EDA

Observations: From experiments so far, performing various transformation on numerical features, we can see that we got best score with MixMaxScaler, then log1p of 'pdays', then using StandardScaler. Since we decided not to use MinMaxScaler, and since score of simply using StandardScaler without any other transformations is equally good (very small difference), we will be choosing this method of transformation for further experiments.

Let us look at the coefficients of the trained Logistic regression model to see which are the features that do not help much (coefficient values close to 0)

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

Observations: Thus we can see that there is extremely tiny difference in the scores with and without the features 'loan','housing','day_of_week','marital'. Thus the model is confirming our observations in EDA. Also the score seems to have slightly improved after removing the features.

Evaluate effect on score by dropping one feature at a time

Check difference in scores - positive difference indicating score improved on removing the feature (feature is unnecessary and impacting model), negative difference indicating score reduced on removing the feature (feature is useful) and the magnitude of the value indicating how important (more or less).

Observations: We can see the following:

Check which features result into least change in score than baseline, irrespective of whether the difference is positive or negative

Observations: Deleting any of the features has very minimal effect on score [less than 0.008 - 0.0015]

Will now experiment with dropping a groups of features (features that on dropping result into increased score) and see the effect on score. We will select the top 8 features that have positive effect when removed.

Observations: Removing combination of features is giving better scores that removing a single feature and almost all of these experiments are scoring very well in comparison with baseline.

Top 10 ranking experiments are:

5.1.2 Model tuning for LogisticRegression

back to TOC

Tuning the parameters using GridsearchCV

Tuning parameters using the top experiment: Use MinMaxScaler to transform numerical features, Delete features 'day_of_week', 'education', 'cons.conf.idx', 'age', 'job'

5.2 DecisionTreeClassifier

back to TOC

Observations: With DecisionTree the score is worse than LogisticRegression - this might be due to overfitting. To check overfitting, we will check the score on y_train itself

Observations: Indeed there is overfitting and hence we did not get good score in previous experiment. This must be due to max_depth default being infinite which causes DecisionTree to almost memorize training data and hence overfitting. This is the same reason why the baseline score with DecisionTree was also bad.

Let us set max_depth to 4 and take a new baseline. For further experiments we can then keep this same value for max_depth.

Observations: Now we have comparatively better score but it is slightly less than the logisticregression baseline.

Let us look at how decisiontree made its decisions

5.2.1 Experiments to improve the score using DecisionTreeClassifier

back to TOC

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

Observations: With DecisionTree the difference in the scores with and without the features 'loan', 'housing', 'day_of_week', 'marital' is very less, thus confirming our observations in EDA.

Let us evaluate scores by transforming the numerical features

Observations: From the experiments so far, we can see that only the score with polynomial features is better than the baseline (new baseline of decisiontree). Also, since the score with added polynomial features and replacing numerical features with polynomial features is the same, we will choose replacing option, as it will reduce number of features to be trained on. We will use this method for further experiments.

Evaluate effect on score by dropping one feature at a time

Observations: Score is higher when dropping any of the feature (one feature at a time), except for 'contact', 'pdays', where score is decreasing when feature is dropped. Lets us check these scores in comparison with score when using polynomial features

We can see that when compared to base score with polynomial features, score is higher only when dropping any of 'month', 'campaign', 'cons.conf.idx', 'age', 'previous'. Let us check if dropping any combination of these features further increases the score.

Observations: Deleting multiple features among ['month', 'campaign', 'cons.conf.idx', 'age', 'previous'] has increased the score with good positive difference. Will choose the combination that led to highest score, and then perform parameter tuning.

Observations: Deleting all the features 'month', 'campaign', 'cons.conf.idx', 'age', 'previous' has led to the best score. We will now do parameter tuning and see the results.

5.2.2 Model tuning for DecisionTreeClassifier

back to TOC

Tuning the parameters using GridsearchCV

5.3 RandomForestClassifier

back to TOC

Observations: The baseline score using RandomForest is slightly lesser than baseline using LogisticRegression

5.3.1 Experiments to improve the score using RandomForestClassifier

back to TOC

Observations: After deleting least important features as per EDA, with Randomforest, there is a slight impact on score compared to baseline (as compared to minimal impact when using LogisticRegression or DecisionTree)

Let us evaluate scores by transforming the numerical features

Observations: From the experiments so far, we can see that using polynomial features increased the score in comparison with baseline of randomforest, although is still less than the main baseline score (with logistic regression). Also other transformations led to decreased score. Thus we will use polynomial features for further experiments.

Evaluate effect on score by dropping one feature at a time

Observations: When compared with polynomial features using randomforest:

deleting one feature has positive impact (increased score) for features - 'cons.price.idx', 'emp.var.rate', 'cons.conf.idx', 'housing', 'previous', 'nr.employed', 'month', 'pdays', 'education', 'poutcome'

while deleting one feature has negative impact (reduced score) for features - 'default', 'day_of_week', 'loan', 'marital', 'contact', 'age', 'campaign', 'job', 'euribor3m'

Let us experiment with removing a group of features that affected positively.

Observations: Dropping multiple features that had positive impact is improving the score for several combinations. Will choose the top scoring combination.

Will tune parameters and check the results

5.3.2 Model tuning for RandomForestClassifier

back to TOC

5.4 XGBoost

back to TOC

Observations: Score using XGBoost without any feature engineering or parameter tuning is better than the baseline. Let us check how the score improves / degrades with higher number of rounds of training.

We can see that while score on validation data improves a bit initially, somewhere from the initial rounds onwards it starts decreasing while score on training continues to increase as number of rounds are increased (this is overfitting). Let us check till what number of rounds is the score actually increasing

Observations: We can see that around num_boost_round 3 the score is highest and then it starts decreasing (due to possible overfitting). We will use num_boost_round=3

Observations Limiting num_boost_round to 3 has increased the score

5.4.1 Experiments to improve the score using XGBoost

back to TOC

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

The difference in score is small after removing the least important features as per EDA, which confirms our EDA observation

Let us evaluate scores by transforming the numerical features

Observations: From the experiments so far, we can see that using polynomial features increased the score in comparison with baseline of xgb, however all the xgb experiments have score better than baseline with logistic regression. Other transformations led to decreased score. Thus we will use polynomial features for further experiments.

Evaluate effect on score by dropping one feature at a time

Observations: Score is higher when dropping any of the feature (one feature at a time). Lets us check these scores in comparison with score when using polynomial features

Observations: We can see that when compared to base score with polynomial features, score is higher only when dropping any of 'job', 'pdays', 'education'. Let us check if dropping any combination of these features further increases the score.

Observations: Dropping 'job' or 'pdays' or 'education' or 'job', 'pdays' or 'pdays', 'education' have better scores, with dropping only 'job' having the highest score. The scores are almost similar for all these top experiments, thus we can choose to use any of these. Will choose dropping 'job' and perform parameter tuning.

5.4.2 Model tuning for XGBoost

back to TOC

Useful references

6. Final model

back to TOC

6.1 Compare results from hyper-parameter tuning for the different models and choose final model

back to TOC

Lets find top 4 scores of each model

Using Plotly since it provides interactive graphs - so that we can see what the values for the points of our interest are.

Note Plotly graphs appear blank when viewing previously run notebook

Observations: We can see that 'mean_test_score' of 0.803311 has 'std_test_score' of 0.005255258 seems to be the overall best score with least std. deviation. So we will check which algorithm and what parameters got this score. From the the df_cons_scores sorted above, we can see that it is the 2nd entry from top (index 241)

Thus, XGBoostClassifier with parameters ['booster': 'gbtree', 'colsample_bytree': 0.4, 'eval_metric': 'auc', 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 3000, 'objective': 'binary:logistic', 'random_state': 42] got us the best score. Now we will train our final model using this.

Note: Plotly graphs appear blank when viewing previously run notebook

Observations: We can see that around iteration 335 the score is 0.79351, then it drops and then from iteration 1286 again it starts increasing upto iteration 2330 - where it is highest 0.79461 (but the increase is very minimal - 0.0011 compared to iter 335) and then its pretty stable. Thus, for practical purposes we will choose training upto iteration 335 as it will take lesser compute resources/time to train and still achieve optimal score possible.

Performing a last training with iteration 335 and validation on validation dataset, before training the final model on full_train dataset

Let us have a look at various metrics like tp, tn, fp, fn, Precision, Recall, F1 score, TRP, FPR, AUC, Accuracy etc. to detemine which threshold should be used to make the decision on the prediction.

Note: Plotly graphs appear blank when viewing previously run notebook

Observations: In ideal scenario we would like our predictions to be always correct i.e. have fp and fn as 0. Practically it is not possible. We will see what threshold makes more sense to us to use for our predictions such that we have comparatively acceptable fp and fn.

Note: Plotly graphs appear blank when viewing previously run notebook

Observations:

Looking at the F1 scores and at what threshold is our F1 score the highest

Looking at the TPR and FPR

Looking at the Precision and Recall for our model

Plotting the ROC curve

6.2 Train final model

back to TOC

Training on full_train dataset for final model

Save model to disk

END of Notebook